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Abstract 

Background: LINE-1 (L1) is the dominant category of transposable elements in placental mammals. L1 has 
significantly affected the size and structure of all mammalian genomes and understanding the nature of the 
interactions between L1 and its mammalian host remains a question of crucial importance in comparative 
genomics. For this reason, much attention has been dedicated to the evolution of L1. Among the most studied 
elements is the mouse L1 which has been the subject of a number of studies in the 1980s and 1990s. These 
seminal studies, performed in the pre-genomic era when only a limited number of L1 sequences were available, 
have significantly improved our understanding of L1 evolution. Yet, no comprehensive study on the evolution of L1 
in mouse has been performed since the completion of this genome sequence. 

Results: Using the Genome Parsing Suite we performed the first evolutionary analysis of mouse L1 over the entire 
length of the element. This analysis indicates that the mouse L1 has recruited novel 5'UTR sequences more 
frequently than previously thought and that the simultaneous activity of non-homologous promoters seems to be 
one of the conditions for the co-existence of multiple L1 families or lineages. In addition the exchange of genetic 
information between L1 families is not limited to the 5'UTR as evidence of inter-family recombination was observed 
in 0RF1, 0RF2, and the 3'UTR. In contrast to the human LI , there was little evidence of rapid amino-acid 
replacement in the coiled-coil of 0RF1, although this region is structurally unstable. We propose that the structural 
instability of the coiled-coil domain might be adaptive and that structural changes in this region are selectively 
equivalent to the rapid evolution at the amino-acid level reported in the human lineage. 

Conclusions: The pattern of evolution of L1 in mouse shows some similarity with human suggesting that the 
nature of the interactions between L1 and its host might be similar in these two species. Yet, some notable 
differences, particularly in the evolution of 0RF1, suggest that the molecular mechanisms involved in host-L1 
interactions might be different in these two species. 
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Background 

Long interspersed nuclear element- 1 (LINE-1 or LI) 
constitutes the dominant category of transposable ele- 
ments in mammalian genomes. Lis have accumulated in 
the genomes of their mammalian hosts in extremely 
large numbers and contribute to more than 20% of gen- 
ome size in human and mouse [1,2]. Lis have been a 
rich source of evolutionary novelties by providing motifs 
that can be recruited by the host either for the 
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regulation of its own genes or within its coding 
sequences [3-6]. However, LI activity can also be detri- 
mental to the fitness of the host [7,8], either by inserting 
within genes [9,10] or by mediating chromosomal rear- 
rangements through ectopic (non-allelic) recombination 
[11,12], LI elements replicate using a copy-and-paste 
mechanism that involves the reverse-transcription of the 
LI RNA at the insertion site [13-15]. LI encodes the 
replicative machinery necessary for the retrotransposi- 
tion reaction. It contains two open-reading frames 
(ORFs) that are both indispensable for LI retrotransposi- 
tion. ORF1 encodes a trimeric protein with RNA- 
binding properties and nucleic-acid chaperone activity 
[16-20]. ORF2 encodes an endonuclease that makes the 
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first nick at the insertion site and a reverse-transcriptase 
that copies LI RNA into DNA at the site of insertion 
[21,22]. LI has a 5' untranslated region (UTR) that acts 
as an internal promoter [23,24] and a 3' UTR with a 
conserved poly-G tract of unknown function [25]. The 
LI retrotransposition reaction produces mostly 5' trun- 
cated elements that are transpositionally inactive [26,27] . 
As the vast majority of LI insertions do not serve a 
function for the host, they accumulate mutations at the 
neutral rate so that young families of LI elements are 
less divergent than older ones [28-32]. 

The pattern of LI element evolution in mammals is 
very unusual. In most species analyzed so far, LI evolves 
as a single lineage: a family of elements emerges, ampli- 
fies to hundreds or thousands of copies and then 
becomes extinct, being replaced by a more recently 
evolved family [30,33-35]. This process is exemplified in 
human where a single lineage of replicatively dominant 
families has evolved over the last 40 MY [30]. The rea- 
son^) why LI evolves as a single lineage remains unclear 
but the similarity between LI and H3N2 influenza A 
virus evolution [36-38] suggests that the single lineage 
mode of evolution could result from a co-evolutionary 
arms race between LI and its host. This hypothesis is 
supported by the observation that the coiled-coil domain 
of ORF1 harbors the signature of adaptive evolution, 
possibly in response to host repression [39], and that 
adaptive evolution apparently correlates with the replica- 
tive success of LI families [30]. However, in early pri- 
mate evolution (from 70 to 40MY), multiple LI lineages 
have co-existed in the human genome [30]. Interestingly, 
co-existing lineages always had non-homologous 5'UTRs 
suggesting that their co-existence could be due to their 
reliance on different host factors for their transcription. 

The patterns described above result mostly from the 
analysis of the human genome and it is unclear how pat- 
terns of evolution in human recapitulate LI evolution in 
other species. It is thus important to examine in greater 
detail the evolution of LI lineages in other mammals. 
Pre-genomics studies in the house mouse (Mus muscu- 
lus) have demonstrated the presence of multiple con- 
currently active LI families with non-homologous 
promoters [33,40-48]. Recently active families are clas- 
sified into two groups based on their promoter types 
(A or F types), whereas ancestral LI families carry a third 
promoter, the V type. The co- existence of multiple LI 
families with different promoters in extant mice recapitu- 
lates the situation in early primate evolution and provides 
a unique opportunity to investigate the interactions be- 
tween concurrent LI families and the molecular proper- 
ties that would allow for such co-existence. 

Previous LI studies in mice were limited to sequence 
analysis performed on a few LI loci, the majority of 
which were fragments of LI inserts. No detailed study of 



LI evolution in mouse has been performed since the 
completion of the mouse genome sequence [2]. With 
the availability of this genome, we decided to perform a 
comprehensive analysis of full-length LI elements to in- 
vestigate the evolutionary dynamics of LI in mouse. We 
present evidence that the diversification of mouse LI 
has been influenced by frequent events of recombination 
across the entire length of the element, rapid structural 
changes in ORF1, as well as lateral transfer by inter- 
specific hybridization. 

Results 

A total of 20,459 LI inserts with complete reverse 
transcriptase (RT) domains were identified using the 
Genome Parsing Suite (GPS). LI elements were first 
grouped based on their 5'UTR. This was done by 
comparing the 5' end of all elements with a library of 
previously described mouse 5'UTR using the Repeatmas- 
ker program [49]. The A, F, V, and Lx 5'UTR types have 
long been characterized [33,50,51] and the majority of 
elements could be assigned to one of these 5'UTR 
sequences. A number of elements however carried 
5'UTRs distinct from these four types. These elements 
were aligned to each other and grouped into three novel 
types of 5'UTR: (1) a 5'UTR with similarity to the F type 
but with distinctive features, named F anc (for F ances- 
tral); (2) a 5'UTR that was not characterized before, 
named Mus (because it is absent from the rat genome); 
and (3) a 5'UTR that shows no similarity with any 
others, named N (for novel). 

Once elements were sorted based on their 5'UTRs, 
they were further categorized into families using a 
phylogenetic analysis of the 3' terminus. A family is 
defined as a collection of elements that result from the 
activity of a highly homogenous group of progenitors, 
which are characterized by a unique combination of 
characters. In the first step of the phylogenetic analysis, 
neighbor joining trees [52] of elements sharing similar 
5'UTRs were built. Distinct clusters of sequences were 
provisionally considered families and were validated by a 
second round of phylogenetic analysis based on the 
principle that elements belonging to the same family 
should yield a star phylogeny (that is, a phylogenetic tree 
devoid of structure) because these elements result from 
the activity of very similar progenitors. These families 
were further confirmed by phylogenetic analysis per- 
formed on other regions of LI to ensure that the homo- 
geneity of the families extend over the entire length of 
the element. 

Using this approach we identified 29 families and con- 
sensus sequences were derived for each of them (Table 1, 
Additional file 1, and Additional file 2). The number of 
variable sites in ORF1, ORF2, and the 3'UTR is 1,441 
(25.1% of the total number of sites), 991 (17.2%) of 
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Table 1 Copy number, divergence, and age of mouse LI families 



Family 3 


Repeat masker 
classification 


Promoter 
type 


LPR 

structure 


Genomic copy 
number b 


Number of FL 
elements 


Average pairwise 
divergence (% ± S.E.) C 


Age (Myr) d 


L1MdA_l 


L1MdA 


A 


66-42 


4,249 


1,620 


0.376 ± 0.073 


0.21 (0.17-0.25) 


L1MdA_ II 


L1MdA 


A 


66-42-42 


5,156 


1,240 


2.939 ± 0.294 


1.62 (1.45-1.78) 


L1MdA_ III 


L1MdA 


A 


66-42-42 


4,337 


606 


3.916 ± 0.304 


2.15 (1.99-2.32) 


LIMdAJV 


L1MdF2 


A 


66-42-42 


1,209 


645 


4.346 ±0.414 


2.39 (2.16-2.62) 


L1MdA_V 


L1MdF3 


A 


66-42-42 


945 


299 


5.167 ± 0.341 


2.84 (2.65-3.03) 


L1MdA_VI 


L1MdF3 


A 


66-66 


5,497 


219 


8.554 ± 0.434 


4.70 (4.47-4.94) 


L1MdA_VII 


L1MdF2 


A 


66-66 


5,684 


759 


8.346 ±0.414 


4.59 (4.36-4.82) 


Tf_l 


L1Md_T 


F 


66-42-42 


5,601 


1,593 


0.462 ± 0.095 


0.25 (0.20-0.31) 


TfJI 


L1Md_T 


F 


66-42-42 




1,282 


0.496 ± 0.087 


0.27 (0.22-0.32) 


TfJII 


L1Md_T 


F 


66-42-42 


4,678 


1,892 


2.233 ± 0.196 


1.23 (1.12-1.34) 


Gf_l 


L1Md_F, L1Md_T 


F 


66-42-42-42 


2,177 


615 


1.356 ± 0.250 


0.75 (0.61-0.88) 


GfJI 


L1Md_T 


F 


66-66-66 


770 


368 


3.929 ± 0.421 


2.16 (1.93-2.39) 


L1MdF_l 


L1MdF2 


F 


66-42-42 


5,1 12 


1,209 


3.853 ± 0.278 


2.12 (1.97-2.27) 


LIMdFJI 


L1MdF2 


F 


66-42-42 




609 


4.537 ± 0.271 


2.50 (2.35-2.64) 


LIMdFJII 


L1MdF2 


F 


66-66 




548 


8.040 ± 0.400 


4.42 (4.20-4.64) 


LIMdFJV 


L1MdF2 


F 


66-42-42 


6,179 


964 


1 1.627 ± 0.503 


6.39 (6.12-6.67) 


L1MdF_V 


L1VL1, L1MdF2 


F 


66-42 


3,936 


884 


1 1 .683 ± 0.487 


6.43 (6.16-6.69) 


L1MdF anc _l 


L1Md_F, L1_Mus1 


Fane 


66-42 


4,398 


418 


12.366 ± 0.610 


6.80 (6.47-7.14) 


L1MdF anc _ll 


L1_Mus2 


Fane 


66-66-66 


16,491 


460 


16.795 ± 0.821 


9.24 (8.79-9.69) 


L1MdN_l 


L1VL1, L1Md_F, L1Md_F3 


N 


66-42-42 


2,237 


367 


3.447 ±0.212 


1.90 (1.78-2.01) 


L1MdV_l 


L1VL1, L1_Mus1 


V 


45-66 


5,777 


318 


15.257 ± 0.647 


8.39 (8.04-8.75) 


LIMdVJI 


L1_Mus3 


V 


66 


3,848 


470 


18.318 ± 0.855 


10.07 (9.60-10.55) 


LIMdVJII 


Lx 


V 


66-66 


NA 


N/A 


1 7.575 ± 0.968 


9.67 (9.13-10.20) 


LIMdMusJ 


L1_Mus1 


Mus 


66-66-42-56 


4,947 


535 


12.068 ± 0.590 


6.64 (6.31-6.96) 


L1MdMus_ll 


L1_Mus2 


Mus 


66-66 


1,924 


304 


14.971 ± 0.521 


8.23 (7.95-8.52) 


L1Lx_l 


L1_Mus3 


Lx 


66-66 


1,649 


384 


19.864 ± 0.846 


10.93 (10.46-11.39) 


LILxJI 


L1_Mus4 


Lx 


66-66 


3,546 


186 


23.907 ± 0.998 


13.15 (12.60-13.70) 


L1Lx_lll 


L1_Mus4 


Lx 


66-66 


3,667 


193 


18.595 ± 0.841 


10.23 (9.76-10.69) 


LILxJV 


Lx 


Lx 


66-66 


NA 


N/A 


25.642 ± 1 .237 


14.10 (13.42-14.78) 



a Family names based on Repeat Masker database. 

b The genomic copy number of TfJ and II and F_l, II, and III were combined due to the small number of diagnostic characters at the 3' end. 
c Average pairwise divergences were calculated using the maximum composite likelihood method (MEGA 4.0 package). 
d Dates were calculated assuming a substitution rate of 1.1% / Myr. 



which are parsimony-informative. The number of vari- 
able sites differs among regions, ORF2 having the largest 
number (785 out of 3,835 sites) followed by the 3'UTR 
(324 out of 652) and ORF1 (318 out of 1,218). However, 
ORF2 has the least number of variable and parsimony- 
informative sites relative to its length (20.5% and 13.9%, 
respectively) and the 3'UTR the most (49.7% and 32.5%), 
ORF1 having an intermediate number (26.1% and 
19.2%). The length of the consensus varies between 
6,000 and 8,000 bp, depending on the number of mono- 
mer repeats in the promoter region. The number of full- 
length (FL) elements varied greatly between families as 
FL elements belonging to older families tend to be less 
numerous in comparisons to younger families. This is 
expected as LI inserts decay over time because of 



internal deletions. The copy number of a few older fam- 
ilies was too low (<10 copies) to derive accurate FL con- 
sensus sequences. Such families were removed from the 
dataset as we maintained a strict rule of using only FL 
elements, that is elements with intact 5'UTR, ORF1, 
ORF2, and 3'UTR. Thus our dataset represents relatively 
high copy number families which have inserted in the 
mouse genome since the split between mouse and rat, 
about 13 MY ago [53]. It is very likely that additional an- 
cient, small copy number families exist but were missed 
by our approach. 

Phylogenetic analysis of LI families based on ORF2 

As LI families have extensively recombined with each 
other (see below), various regions of LI yield different 
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evolutionary histories and it is impossible to build a sin- 
gle phylogenetic tree based on the entire length of the 
element. Figure 1 shows the tree built using the longest 
non-recombining segment of ORF2 (2.5Kb). This seg- 
ment recapitulates the evolutionary history of LI 
lineages more faithfully than other regions because it 
has not recruited older sequences that would have dis- 
torted its evolution. In addition, the branching order on 
this tree is generally consistent with the age of the 



families (Table 1), so that older families are closer to the 
base of the tree and younger families appear more 
derived. The most recently active families, the LIMdA 
lineage (characterized by an A promoter) and the 
LIMdTf lineage (characterized by an F promoter), clus- 
ter into well supported paraphyletic and monophyletic 
lineages, respectively. Each of these lineages contains 
three families, namely LlMdA_I, II, and III and 
LlMdTf_I, II, and III. We also identified two families 
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Figure 1 Phylogenetic tree of mouse LI families based on the longest non-recombining region of ORF2, including the reverse 
transcriptase domain. This segment corresponds to the region between nucleotide 2095 and 4489 on the alignment provided as 
supplementary material. The tree was built using the maximum-likelihood method with the HKY+G model. The numbers indicate the percentage 
of time the labeled node was present in 1,000 bootstrap replicates of the data. Red arrows indicate the acquisition of a new 5'UTR. 
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that could be classified as LIMdGf, based on similarity 
with a previously described family [43]. However, these 
two families (provisionally named LIMdGf J and II) do 
not form a monophyletic group as LIMdGf J appears 
more related to LIMdTf and LIMdGfJI groups with 
LIMdA families. The branch leading to this group of ac- 
tive and recently active families is composed of four 
families with an A promoter (LlMdA_IV to VII) and the 
only family carrying the N promoter (LlMdN_I). These 
families evolved from a group of sequences carrying an 
F promoter (LIMdFJV and V). Families LIMdFJ, II, 
and III constitute a lineage that evolved independently 
and in parallel with the main A lineage. The F lineage 
possibly evolved from a family which was carrying a V 
promoter and which appears to be the last active family 
with this promoter type. This family in turn evolved 
from a family carrying the Mus promoter, which appar- 
ently evolved from a family carrying the F anc promoter 
(LlMdF anc _II). At the same time two families branched 
independently from the main lineage, one carrying a 
Mus promoter (LIMdMusJ) the other one the F anc pro- 
moter (LlMdF anc _I). Preceding the LlMdF anc _II family a 
lineage made of four families with an Lx promoter was 
active. At two points in time the Lx promoter was 
replaced by the V promoter (yielding LIMdVJI and III) 
but these families did not persist or produce novel 
lineages. 

One of the most striking features visible on the tree is 
that families with similar 5'UTRs do not form monophy- 
letic groups indicating that LI families have frequently 
recruited novel 5'UTR, either from unknown sources or 
from ancient families. The oldest families in our study 
carried an Lx promoter, which was replaced three times: 
once by the F anc promoter (LlMdF anc _II) and by the V 
promoter twice (LIMdVJI and III). The F anc promoter 
was replaced independently twice by the Mus promoter 
as LIMdMusJ and LIMdMusJI do not form a mono- 
phyletic group. The Mus promoter was eventually 
replaced by the V promoter (LlMdV_I) and went ex- 
tinct. The F promoter was then resuscitated approxi- 
mately 6.4 MY ago and gave rise to families LIMdFJ to 
V. Approximately 4.6 MY ago the A promoter was 
recruited yielding the modern A lineage which extend 
from families LlMdA_VII to I. Within this lineage, an 
additional recruitment occurred resulting in the LlMdN_I 
family. Finally the F promoter was recently recruited 
twice, approximately 2.2 MY by the LIMdGfJI family 
and approximately 1.2 MY by the Tf/GfJ lineage. Thus 
we estimate that LI in mouse has experienced 11 replace- 
ments of 5'UTR. 

The topology of the ORF2 tree indicates that mouse 
LI families evolved mostly as a single lineage. This does 
not mean that a single family or single lineage was active 
at a time. In fact, the co-existence of multiple active 



families characterizes the evolution of LI for the last 
13MY of mouse evolution. For instance between 1 and 
2.5 MY ago, six families (LIMdTf JII, LIMdA JI, 
LIMdA JII, LIMdGfJI, LMdNJ, and LIMdFJ) were 
active in the mouse genome as attested by the overlap in 
their average pairwise divergence (Table 1). In some 
cases, several families evolved into lineages that diversi- 
fied and co-existed with the dominant lineage for several 
MY. The lineage composed of LIMdFJ, II, and III is the 
one that co-existed the longest with the lineage that 
yielded the currently active families. LIMdFJ was active 
2.12 MY ago, at about the same time as families 
LIMdAJII and LIMdNJ. These families, however, are 
all descendants of family LIMdFJV which was active 
6.4 MY ago (Figure 1 and Table 1). Thus the lineage 
consisting of LIMdFJ, II, and III co-existed with the 
lineage that produced LIMdAJII and LIMdNJ for 
more than 4 MY. Eventually the LIMdF lineage became 
extinct. Thus the cascade structure of the ORF2 tree, 
typical of the single lineage mode of evolution reported 
in other mammals, is consistent with a model in which 
multiple families are concurrently active until one of 
them attains replicative supremacy, coinciding with the 
extinction of its competitors. 

Detection of recombination among murine LI families 

Because LI families have frequently recruited novel pro- 
moters we decided to examine if LI lineages have 
exchanged genetic information in other regions of the 
element. To this end, several methods implemented in 
the RDP 3.0 software were used: two substitution-based 
approaches, MaxChi [54] and Chimera [55], and two 
phylogenetic approaches, Bootscan [56] and RDP [57]. 
Breakpoints and statistically significant events of genetic 
recombination detected by RDP were verified by visual 
inspection of the FL consensus alignment (see Additional 
file 3) and phylogenetic analyses. A minimum of six re- 
combination events was detected. 

Starting with the most recent events, the LIMdTf and 
LIMdGf families were the result of three independent 
recombination events between LIMdAJII and LIMdF 
families. Analyses of non-recombinant segments span- 
ning ORF1 and the 5' end of ORF2 indicate that both Tf 
(Figure 2B) and Gf (Figure 2C) families are nested within 
the more ancestral LIMdF lineage. However, the top- 
ology derived from the region spanning the central sec- 
tion of ORF2 suggests that Tf and Gf are decendants 
from the LIMdA family. The recombination events that 
produced these families occurred independently as the 
recombination breakpoints are different. The breakpoint 
for the two Gf families lies towards the 5' end of ORF2, 
but are approximately 30 bp apart (see Additional files), 
reflecting two independent events of recombination sup- 
ported by the considerable number of differences 
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5'UTR 0RF1 0RF2 3'UTR 




Figure 2 Evidence for recombination between mouse LI families. (A) Schematic structure of an L1 element; (B) Recombinant origin of the Tf 
families; (C) Independent recombinant origin of the GfJ and GfJI families; (D) Evidence for recombination at the ORF2-3'UTR junction; (E) 

Evidence for the transfer of the coiled coil domain from MusJI to A_VI, A_VII, F I II, and GfJI. The numbers in parentheses correspond to the 

position of the fragments used to build the tree relative to the alignment provided as supplementary material and beginning at position 1 of 0RF1. 
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between LIMdGfJ and LIMdGfJI in ORF1 (see 
below). Based on differences in ORF1 we determined 
that LIMdGfJI could result from a recombination event 
between LIMdFJII and LIMdAJII and LIMdGfJ 
from recombination between LIMdFJ or II and 
LIMdAJII. The three LIMdTf families result from re- 
combination between LIMdFJI and LIMdAJII, but the 
breakpoint for the Tf families is located approximately 
700 bp downstream from the breakpoints detected in 
the Gf families. This breakpoint is shared among the 
three Tf families suggesting the recombination event oc- 
curred at the origin of the Tf lineage. 

The next oldest recombination event is between the 
ancestor of LIMdAJV (which is the ancestor of 
LIMdAJ, II, and III) and LIMdFJI, near the 3' end of 
the element (Figure 2D). A 666 bp region was trans- 
ferred from LIMdFJI to the LIMdAJV family. This 
fragment is also found in all LIMdA sequences derived 
from LIMdAJV as well as the Gf and Tf families since 
they also acquired their ORF2 and 3'UTR from an 
ancestral LIMdA family. Similarly, a segment located in 
the coiled-coil domain of ORF1 was transferred from 
LIMdMusJI to LlMdA_VII and LlMdA_VI (Figure 2E). 
Subsequently an overlapping region was transferred from 
LlMdA_VII or LlMdA_VI to LIMdFJII. This segment is 
also found in LIMdGfJI as this family got its ORF1 from 
LIMdFJII. 

It should be noted that our criteria for identifying re- 
combination events were stringent, as we only considered 
the recombination of large segments to be significant. 
Thus it is plausible that exchanges of sequences of shorter 
length have occurred between LI families but were not 
detected due to the small number of defining characters 
in some conserved regions of LI, such as ORF2. The 
number of recombination events reported here suggests 
that recombination has played a significant role in the 
evolution of novel LI families in mouse and can occur 
across the entire length of LI. 

The exchange of genetic information between families 
constitutes a significant challenge for evolutionary ana- 
lyses as most phylogenetic algorithms do not allow for 
recombination. Thus we performed phylogenetic ana- 
lyses using regions of LI delimited by recombination 
breakpoints to fully assess the impact of recombination 
on the evolutionary history of FL LI elements (Figure 3). 
Trees A and B are based on the coiled coil domain of 
ORF1 and the 3' half of ORF1 through the 5' end of 
ORF2, respectively. The main difference between the 
ORF2 tree and tree B is that recently active families with 
similar 5'UTRs form monophyletic groups: families 
LIMdAJ to VI cluster together and families LIMdFJ, 
II, and III, TfJ, II, and III, and GfJ and II group to- 
gether (tree B on Figure 3). Further upstream in the 
coiled coil domain (tree A on Figure 3) this monophyly 



vanishes because of the transfer of the coiled-coil motif 
from LIMdMusJI to LlMdA_VI, LlMdA_VII, LIMdGfJI, 
and LIMdFJII. Tree C is based on the 3' terminus of ORF2 
and the 5' end of the 3' UTR. The main difference with the 
ORF2 tree is the position of all families that are descendant 
of families LIMdAJV (that is LIMdAJ to III, the Tf, and 
the Gf families). These families appear closer to families 
LIMdFJ to III than to families LlMdA_V to VII because of 
the transfer of this segment from LIMdFJI to LIMdAJV. 
Further downstream, the tree based on the 3' terminus of 
LI (tree D) lacks resolution because of the length of the se- 
quence analyzed and the small number of characters differ- 
entiating the families. The main difference with tree C is the 
position of family LIMdGfJI which branch outside a 
monophyletic group composed of families LIMdTf, 
LIMdGfJ, and LIMdAJ to IV, consistent with the inde- 
pendent origin of this recombining family. 

Evolution of the ORFs 

We then examined the evolution of the protein coding 
sequences encoded in LI, ORF1, and ORF2. ORF2 is the 
most conserved region of LI. There are very few amino 
acid changes, in particular in the endonuclease and re- 
verse transcriptase domains which are functionally indis- 
pensable [21,58]. All the methods we used to assess the 
impact of selection on ORF2 indicate that this region is 
evolving under strong purifying selection, that is selec- 
tion against amino acid changes (Table 2). We analyzed 
separately the 5' and 3' termini of ORF2 because of the 
presence of recombination. In both regions, the PARRIS 
methods found no evidence that a subset of amino-acid 
is evolving under positive selection and estimated a 
mean dN/dS of 0.308 and 0.229, for the 5' and 3' termini, 
respectively. Similarly, the values of dN/dS estimated by 
the GABranch method were all significantly lower than 
1. In addition, two of the three methods used to detect 
selection at specific amino acid (SLAC and REL) failed 
to find evidence of positive selection, although they 
identified a large number of amino acid under negative 
selection (not shown). The FEL method identified two 
amino acids that could have evolved under positive se- 
lection but as these two residues have not been recov- 
ered by the two other methods, it is likely they 
constitute false-positives. 

We examined the level of conservation of domains of 
ORF1 that are known to be functionally important 
[19,59,60]. Three domains have been identified: a coiled 
coil (CC) domain that mediate the formation of ORFlp 
trimers, a RNA-recognition motif (RRM), and a C- 
terminal domain (CTD). The 3' half of ORF1, which 
contains the RRM and CTD domains, as well as ap- 
proximately the first 50 amino acids of ORF1 are very 
conserved across families, in contrast with the CC do- 
main that shows a high level of structural variation. We 
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Figure 3 Phylogenetic trees of mouse LI families based on (A) the coiled coil domain, (B) the 3' end of ORF1 and the 5' terminus of 
ORF2, (C) the 3' terminus of ORF2 and the 5' end of the 3'UTR and (D) the 3' terminus of the 3'UTR. The trees were built with the 
maximum-likelihood method using the JC (tree A), TN93+G (B), HKY+G (tree C) or T92 (tree D) models. The numbers indicate the percentage of 
time the labeled node was present in 1,000 bootstrap replicates of the data. The numbers in parentheses correspond to the position of the 
fragments used to build the tree relative to the alignment provided as supplementary material and beginning at position 1 of ORF1. 



analyzed independently the 5' terminus, the CC domain, 
and the 3' half of ORF1 for evidence of selection using 
recombination breakpoints as boundaries. All the meth- 
ods used strongly indicated that the 5' terminus and the 
3' half of ORF1 are evolving under purifying selection. 



The PARRIS method rejected the hypothesis that a sub- 
set of amino acid is evolving under positive selection 
and the GABranch method showed that dN/dS has 
remained significantly lower than 1 in these regions dur- 
ing the entire evolutionary span covered by the analysis. 



Table 2 Summary of selection detection tests 
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ORF 
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Mean dN/dS 


Number of branches with positive selection 


SLAC 
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REL 


ORF1 


5' terminus 


0.494 ± 0.275 
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0 




Coiled coil 


0.608 ± 0.401 
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8,089 




3' terminus 


0.354 ± 0.371 
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0 
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348,351 


ORF2 


5' terminus (1-1,170) 


0.308 ± 0.41 1 
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3' terminus (1 171 -end) 


0.229 ± 0.353 
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0 
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This is not surprising, especially for the 3' half of ORF1, 
as the RRM and CTD motifs were shown to be con- 
served across mammals [60]. The SLAC, FEL, and REL 
programs failed to identify a single amino acid under 
positive selection at the 5' end. In % the REL method 
identified two amino acids under positive selection but 
these residues are likely to be false-positive as the 
changes in amino acid result from independent events of 
mutation at CpG nucleotides, which are known for their 
unusually high mutation rate. 

More surprising is the degree of conservation at the 
amino acid level of the CC domain. Previous studies 
have shown that the CC domain of ORF1 has evolved 
under positive selection in primates [30,39]. In the case 
of the mouse, surprisingly, the PARRIS method rejected 
the hypothesis that some amino acid evolved under posi- 
tive selection, although a moderately high dN/dS ratio 
was obtained (0.608), and the GA Branch method failed 
to identify a single branch in the evolution of the coiled 
coil with a dN/dS >1. Out of the three methods (SLAC, 
FEL, and REL) used to detect selection at specific amino 



acids, only one (REL) identified two amino acids that 
could have evolved under positive selection. It is thus 
plausible that these two sites are false-positive as they 
have been identified by a single method. Even if these 
sites are evolving under positive selection, it remains 
true that the signature of positive selection in the mouse 
CC is much weaker than it is in human [30,39]. 

Although the CC domain is relatively conserved at the 
amino acid level, it shows a high level of structural vari- 
ation. Previous studies have identified a region called 
length polymorphism region (LPR) [33,61]. Using our FL 
consensus alignments we were able to reconstruct the 
complex history of this region (depicted on Figure 4). 
The ancestral state is found in the oldest families (Lx_I, 
Lx_II, Lx_III, Lx_IV, and LIMdMusJI) and contains 
two 66 bp repeats. From this ancestral motif, four inde- 
pendent modifications have occurred: the loss of the sec- 
ond 66 bp repeat in LIMdVJI, a 21 bp deletion in the 
first 66 bp repeat found in the LlMdV_I family, a dupli- 
cation of the second repeat resulting in three 66 bp 
repeats in LlMdF anc _II and a 24 bp deletion in the 
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Figure 4 Evolution of the length polymorphic region of ORF1 in mouse. The blue boxes correspond to the 66 bp motifs and the orange 
box, the 42 bp motifs. The position of the polymorphic region on a full-length L1 element is displayed in the bottom right of the figure. 
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second repeat found in LlMdF anc _I and LIMdFJV. The 
66-42 bp motif was followed by a duplication of the 42 
bp unit resulting in a 66-42-42 bp structure which is 
found in families LlMdA_V to II, LIMdNJ, LIMdTfJII 
to I, and LIMdFJ, II, and V. This motif further evolved 
by the loss of the second 42 bp repeats in LlMdA_I and 
LIMdFJV and by the addition of a third 42 bp unit in 
family LlMdGf_L The ancestral 66-66 bp motif was 
recruited by recombination in families LIMdFJII, 
LlMdA_VI, and VII, and acquired a third 66 bp unit in 
family LIMdGfJI. These structural changes in the LPR 
resulted in changes in the length and structure of the 
CC Coiled coils are formed from two or more a-helical 
peptide chains that contain a distinct arrangement of 
non-polar side chains. Domains that can form CC con- 
sist of heptads (or seven residue repeats) with non-polar 
or hydrophobic residues in the first and fourth positions 
[62]. The CC in LI plays an important role in holding 
together the dumbbell-shape ORFlp trimers [18]. The 
shortest CC domain is 66 amino acids long and contains 
seven heptads (based on predictions using the program 
COILS) in family LIMdVJ. The longest CC is 111 
amino acids long and contains 12 heptads in family 
LIMdGfJ. Between these two extremes, families with 8, 
10, and 11 heptads were found. 

Evidence for the lateral transfer of LI families 

Finally, we examined the possibility of lateral transfer in 
the evolution of murine LI. In mammals, LI is transmit- 
ted vertically and there is no evidence of lateral transfer 
[63], except in case of inter-specific hybridization. Inter- 
specific hybridization had previously been described 
among mice of the genus Mus and it has been proposed 
that some LI families in the house mouse genome were 
acquired by hybridization [44,64,65]. In order to detect 
hybridization we used a phylogenetic approach: if a LI 
family is invading a genome through hybridization, long 
branches might be expected with a lack of intermediate 
sequence on a tree built using genomic copies. In con- 
trast, under the strict vertical mode of transmission, 
intermediate sequences would be expected between all 
families. We built a tree using the 3' UTR of a large 
number of genomic copies representative of the most re- 
cently active families (Figure 5). Two cases of long 
branches with no intermediate sequences were found: 
one leading to the LIMdTfJ and II families, and the 
other leading to LIMdGfJ. This analysis suggests that 
the LIMdGfJI and LIMdTfJII families evolved within 
the house mouse genome but that the LIMdTfJ and II 
and the LIMdGfJ families were acquired through inter- 
specific hybridization. We can also infer that these trans- 
fers resulted from two independent hybridization events 
since the two Tf families amplified approximately 0.25 



MY ago whereas LIMdGfJ amplified approximately 
0.75 MY ago. 

Discussion 

We performed the first comprehensive analysis of LI 
evolution since the completion of the mouse genome 
[2]. The analysis is limited to the most recently active LI 
families and covers approximately the last 13 MY of 
mouse evolution. As murine rodents evolve approxi- 
mately eight times faster than hominoids, the amount of 
evolutionary change investigated here is similar to previ- 
ous studies in humans that covered more than 80 MY of 
primate evolution [30,35]. The results are consistent 
with the large number of analyses performed in the pre- 
genomic era [32,33,41-45,50,65-68] but, by focusing 
solely on intact FL elements, we were able to provide for 
the first time a complete picture of the evolution of 
mouse LI families over the entire length of the element. 

Evolution of LI as a single lineage 

The evolution of LI in mouse fits the single lineage 
mode of evolution described previously in other mam- 
mals and particularly in human [30,35,63,69]. This is ex- 
emplified by the similarity between the tree in Figure 1 
and the tree based on the human ORF2 (Figure 2 in 
[30]). This model is based on the observation that LI 
phylogenies have a typical cascade structure that is best 
explained by the successive activity of LI families: a sin- 
gle family, or a group of closely related families, is active 
at a given point in time until a new family emerges and 
replaces the pre-existing family, which usually becomes 
extinct. In some instances, however, several lineages may 
co-exist until one eventually becomes extinct. This is the 
case of the LIMdFJ, II, and III lineage which co-existed 
with the dominant lineage for approximately 4 MY and 
of the Tf and LIMdAJ, II, and III lineages that co- 
existed for about 2 MY and are still active in the mouse 
genome. In ancestral primates a similar situation oc- 
curred but on a much longer period of evolutionary time 
as the L1PB and L1PA lineages co-existed for 30 MY 
[30]. We previously observed that, in human, LI lineages 
that co-exist for extended periods always have different 
promoter sequences. We proposed that families with dif- 
ferent promoter sequences rely on different host-factors 
for their transcription and are consequently not relying 
on the same host-encoded resources [30]. This situation 
allows them to co-exist as they are not using the same 
genomic 'niche'. In mouse the same observation can be 
made. The lineage composed of LIMdFJ, II, and III co- 
existed with the main lineage when this one was domi- 
nated by families carrying the A promoter (LIMdAJII 
to VI). Similarly, the two lineages that are currently ac- 
tive, the LIMdAJ, II, and III and the LIMdTf/Gf, carry 
different, non-homologous 5'UTRs. Thus, it is possible 
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Figure 5 Phylogeny of genomic copies showing lateral transfer 
of the LI MdTfJ, LI MdTfJ I, and LIMdGfJ families. The tree was 
built using the neighbor joining method based on Kimura 2- 
parameters distance. The long branches suggestive of lateral 
transfers are indicated with black arrows. In contrast the LIMdGfJI 
and LIMdTfJII families, as well as the three L1MdA families, are not 
separated from other sequences by long branches indicating they 
have evolved from older families within the mouse genome. The 
sequences used to build this tree were randomly chosen within each 
of the recently active families in the mouse genome. When other 
sequences are selected, the topology of the tree remains the same. 



that the conditions that allow for multiple lineages to 
co-exist are the same in mouse and in human. Unlike in 
modern human where a single family is currently active 
(the Ta family) [28], the modern house mouse genome 
harbors several families with different 5' UTR and con- 
sequently present an excellent model to test experimen- 
tally the hypothesis that the activity of different 5'UTR is 
one of the conditions for the co-existence of families 
and lineages. 

Acquisition and exchange of sequence during LI 
evolution 

The analysis of FL elements has revealed the extraordin- 
ary ability of LI families to acquire novel motifs and to 
exchange sequences (Figures 2 and 3). The recruitment 
of novel 5'UTR sequences [30,33] as well as the recom- 
binant nature of some LI families in mouse [45,46] and 
rat [34,69,70] have long been described. Three mechan- 
isms have been proposed to account for the mosaic na- 
ture of some families. First, recombination between 
genomic copies, that is at the level of DNA templates, 
could result in the formation of a novel transpositionally 
competent family. This hypothesis has been discounted 
on the basis that it is highly unlikely that a chance re- 
combination event between two replicatively competent 
elements occurred while recombination between any of 
the hundreds of thousands LI pseudogenes, the majority 
of which have suffered the effect of inactivating muta- 
tions, is much more likely to produce an inactive elem- 
ent [69]. Second, recombination could occur at the time 
the LI RNA is reverse- transcribed and could result from 
the formation of a RNA/DNA heteroduplex between the 
LI RNA and a genomic copy at the insertion site [71]. 
This model is supported by the observation that the re- 
cruitment of novel motifs seems to be directional as it is 
always a chronologically young 3' end that recruits an 
older 5' terminus [69]. Third, mosaic elements could be 
produced if the LI encoded reverse transcriptase 
switches RNA strand at the time of insertion. Polymer- 
ase strand- switching is a well-known feature of RNA 
viruses [72,73]. This mechanism insures that recombin- 
ation occurs between replicatively competent elements, 
that is elements that carry a 5'UTR capable of driving 
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their transcription. The third model predicts that recom- 
bination occurs only between families that are simultan- 
eously active whereas the first and second models do 
not have such a requirement. We found that the ex- 
change of genetic information occurs both between sim- 
ultaneously active families and by resuscitation of motifs 
from extinct families. For instance, the coiled-coil do- 
main of LlMdMus_II has been recruited by LlMdA_VII 
about 4.6 MY ago, long after the extinction of LlMdMu- 
s_II which was active 8.23 MY ago. The LIMdGfJI fam- 
ily is also the product of a recombination between two 
families that were not active simultaneously, the 
LIMdFJII and the LIMdAJII families (which amplified 
4.42 and 2.15 MY ago, respectively). All other instances 
of recombination occurred between families that were 
simultaneously active, which is consistent with the poly- 
merase strand-switching model. Similarly, the acquisi- 
tion of novel 5'UTRs tend to result from the transfer of 
5' termini between families that were active at the same 
time. This is exemplified by the evolution of the F-type 
which was transferred from LlMdF anc _I (active 6.80 MY 
ago) to the ancestor of LlMdF_V (at 6.43 MY) and 
subsequently transferred from LlMdF_I (active 2.12 MY 
ago) to the recently active LIMdTf and LIMdGf 
families. 

Evolution of 0RF1 

The first ORF is arguably the least understood region of 
LI, although it has been the subject of much attention 
in the past few years [17-20,59,60,74-78]. Its secondary 
structure has been resolved as a dumbbell shape result- 
ing from the formation of a trimeric structure mediated 
by the coiled coil domain [18]. It is established that it 
has RNA-binding abilities, mediated by the RRM, can 
act as a nucleic acid chaperone [19,20] and form multi- 
mers in the presence of nucleic acids [78]. Previous 
studies have shown that the 3' half of ORF1 is very con- 
served [60] and our analysis confirms this is the case in 
mouse. In contrast, studies in human have demonstrated 
that the coiled-coil domain is evolving under strong 
positive selection as indicated by the high values of dN/ 
dS reported in the evolution of this region [30,39]. Such 
a rapid evolution at the amino-acid level is certainly 
adaptive and it was proposed that this was the result of 
an arms -race between LI and its human host. This hy- 
pothesis was further supported by the fact that periods 
of adaptive evolution in the coiled coil coincide with 
period of intense LI activity [30]. However, we failed to 
find strong evidence of adaptive evolution in the mouse 
coiled coil. In contrast we found an extraordinary level 
of structural instability in this region (Figure 4), unex- 
pected in a protein coding region critical for the multi- 
meric structure of the functional protein. Instability in 
this region has also been described in the rat LI 



suggesting a common role for these structural changes 
in these two species [34,69]. Structural changes in the 
coiled coil occur so frequently that it is tempting to 
speculate that they are adaptive, and are evolutionarily 
equivalent to periods of intense amino acid replacement 
in humans. 

Conclusions 

We performed a comprehensive analysis of LI evolution 
in mouse. This analysis covered the last 13 MY of mouse 
evolution, since the split between mouse and rat. The 
mouse LI has evolved as a single lineage for most of its 
evolution, although co-existence between families carry- 
ing different promoter sequences was observed. LI fam- 
ilies have frequently acquired novel 5'UTR and have 
exchanged sequences over the entire length of the elem- 
ent. No evidence of rapid amino acid replacement in the 
ORF1 was detected, although it is likely that the struc- 
tural instability of the CC domain is adaptive. The gen- 
eral pattern of evolution of mouse LI is similar to the 
one in human suggesting that the nature of the interac- 
tions between LI and its host might be similar in these 
two species. There are however some intriguing differ- 
ences between mouse and human, particularly in the 
evolution of ORF1. These differences suggest that the 
molecular mechanisms involved in host-Ll interactions 
might be different in these two species. 

Methods 

Collection and classification of full-length LI elements 

Full-length (FL) elements were collected from the Mus 
musculus 2006 (mm8) genome built using the GPS [79]. 
GPS conducted a BLAST type-search (WU-tBLASTn) of 
the genome using the conserved Reverse Transcriptase 
(RT) domain of ORF2 as a query. GPS then cut 7,000 bp 
upstream and downstream of the RT domain yielding a 
14,000 bp fragment. A second WU-tBLASTn was then 
performed on the 14,000 bp cutouts to identify regions 
characteristic of LI (ORF1, the endonuclease domain of 
ORF2, the RT domain, and the 3'UTR). In this analysis, 
GPS did not search for sequence identity at the 5' end 
since LI is known to frequently recruit novel sequences 
as 5'UTR [30,33]. Thus, a file containing 3,000 bp up- 
stream of ORF1 was generated for further analyses. The 
FL sequences were first sorted based on their 5'UTRs. 
Once elements were sorted based on their 5'UTRs, they 
were further categorized into families using a phylogen- 
etic analysis of the 3' terminus. A family is defined as a 
collection of elements that result from the activity of a 
highly homogenous group of progenitors, which are 
characterized by a unique combination of characters. In 
the first step of the phylogenetic analysis, neighbor join- 
ing trees [52] of elements sharing similar 5'UTRs were 
built. Distinct clusters were provisionally considered 
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families and were validated by a second round of phylo- 
genetic analysis based on the principle that elements 
belonging to the same family should yield a star phyl- 
ogeny because they result from the activity of similar 
progenitors. These families were further confirmed by 
phylogenetic analysis performed on other regions of LI 
to verify that the homogeneity of the families extend 
over the entire length of the element. Full-length con- 
sensus sequences were derived for each family and are 
available on Repbase. Phylogenetic analyses were per- 
formed using the neighbor joining (NJ) method [52] 
based on the maximum composite likelihood parameters 
distance included in the MEGA 5.01 software package 
[80]. The model that best fits the data was determined 
for each alignment using MEGA. The robustness of each 
phylogenetic tree was assessed using a bootstrap proced- 
ure with 1,000 replicates. Families were named by the 
name of the 5' promoter (A, F, F anc , V, Lx, Mus, or N; 
see result) followed by a roman number. The smaller the 
roman number, the younger the family is. For instance 
families LIMdAJ, LIMdAJI, and LIMdAJII are sub- 
sets of the previously described LIMdA family; family 
LIMdAJ is younger than family LIMdAJI and family 
LIMdAJII is the oldest of the three. We kept the Gf 
[43] and Tf [42] names for the recently active Tf and Gf 
families because these names have been widely used in 
the literature. 

Analysis of FL elements 

NJ, maximum parsimony (MP), and maximum likeli- 
hood (ML) trees were calculated for each region of LI. 
Phylogenetic trees were reconstructed using the MEGA 
5.01 package [80]. The RDP3.0 program (Recombination 
Detection Program 3.0, available at http://darwin.uvigo. 
es/rdp/rdp.html) was used to search for evidence of 
recombination among families. RDP allows for the 
use of several recombination detection methods includ- 
ing substitution and phylogeny-based methods. Two 
substitution-based methods, MaxChi [54] and Chimaera 
[55], as well as a phylogenetic method, bootscan [56], 
were used to analyze the datasets. The RDP software 
also includes its own unique algorithm termed 'RDP' 
[57] which is also a phylogenetic approach to detecting 
recombination. A window size of 50 bp was used to de- 
tect breakpoints between consensus sequences. Statisti- 
cally significant events of recombination were verified by 
comparing phylogenetic trees on each side of the puta- 
tive breakpoint. 

To test for evidence of selection in the evolution of LI 
several methods implemented in the web server www. 
datamonkey.com [81] of the HyPhy program [82] were 
used. The first method uses a maximum likelihood ap- 
proach (PARRIS) to determine if a proportion of site in an 
alignment evolves with a ratio dN/dS>l [83]. A ratio 



significantly >1 is indicative of positive selection whereas a 
ratio <1 is indicative of purifying selection. The second 
method, GABranch [84] can detect lineage-specific vari- 
ation in selective pressure and requires no a priori specifi- 
cation of branches in a phylogeny that may have evolved 
under different values of dN/dS. The dN/dS test is how- 
ever not very sensitive, particularly if selection acts on a 
few codons. For this reason we used three methods 
designed to detect the action of positive or negative selec- 
tion at specific sites in an alignment: Single Likelihood 
Ancestor Counting (SLAC), a Random Effects Likelihood 
(REL), and Fixed Effects Likelihood (FEL) [85]. For each 
dataset, the model that best fits the data was determined 
using the tool available at datamonkey.com. As selection 
detection methods are sensitive to recombination, we per- 
formed our analyses independently for each segment of 
LI flanked by recombination breakpoint. Previous studies 
on human LI have documented positive selection in the 
coiled-coil (CC) domain of ORF1 [30,39]. CC structures 
are formed from two or more a-helical peptide chains that 
contain a distinct arrangement of non-polar side chains 
[62]. Domains that can form CC consist of heptads (or 
seven residue repeats) with non-polar or hydrophobic 
residues in the first and fourth positions. The program 
COILS [62] was used to identify the position of the CC 
domain in each consensus sequence as well as the number 
of constitutive heptads. 

Age and copy number of LI families 

The age of each subfamily was estimated by calculating 
the average pairwise divergence based on the 3'UTR. 
CpG dinucleotides and the highly mutable polypurine 
tract located in the 3'UTR were removed from align- 
ment. The average divergence between copies as well as 
the standard error was calculated using the maximum 
likelihood parameter distance (using the MEGA 5.01 
software). Divergences were converted to time assuming 
a neutral rodent genomic substitution rate of 1.1%/MY 
(calculated using the data presented on Table 5 of [86] 
and assuming a divergence MuslRattus at 13MY [53]). 

Availability of supporting data 

The consensus sequences are available in Repbase 
(http://www.girinst.org/repbase/). 
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Additional file 1: Alignment of mouse LI consensus sequences 
starting at the beginning of ORF1. 0RF1 spans positions 1 to 1,218 
and 0RF2 spans positions 1,262 to 5,096. 

Additional file 2: Matrix of pairwise divergence based on the 
longest non-recombining fragment of ORF2 (from position 2,085 to 
4,489 in Additional file 1). 
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Additional file 3: Alignments showing recombination break-points 
among L1 families. Only the parsimony-informative sites are shown. 
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